Support for cloud storage
fastdup supports two types of cloud storage:- Amazon AWS S3 using
awscli
- Min.io cloud storage API
Amazon s3 aws cli support
Preliminaries:
- Install aws cli using the command
sudo apt install awscli
- Configure your aws using the command
aws configure
- Make sure you can access your bucket using
aws s3 ls s3://<your bucket name>
How to run
There are two options to run:- In the input_dir command line argument put the full path your bucket for example:
s3://mybucket/myfolder/myother_folder/
This option is useful for testing but it is not recommended for large corpouses of images as listing files in s3 is a slow operation. In this mode, all the images in the recursive subfolders of the given folders will be used. - Alternatively (and recommended) create a file with the list of all your images in the following format:
files.txt
you can run with input_dir=’/path/to/files.txt’
Using AWS endpoints
fastdup supports the use of an AWS endpoint for accessing data stored on S3, i.e., S3 access done using the following command:FASTDUP_S3_ENDPOINT_URL
environment variable to the endpoint’s IP address, by either:
Notes:
Currently we support a single cloud provider and a single bucket.It is OK to have images with the same name assuming they are nested in different subfolders. In terms of performance, it is better to copy the full bucket to the local node first in case the local disk is hard enough. Then give the
input_dir
as the local folder location of the copied data. The explanation above is for the case the dataset is larger than the local disk (and potentially multiple nodes run in parallel).
Min.io support
Preliminaries
Install the min.io client using the commandHow to run
There are two options to run:- In the input_dir command line argument put the full path your cloud storage provider as defined by the minio alias, for example:
minio://google/mybucket/myfolder/myother_folder/
(Note that google is the alias set for google cloud, and the path has to start withminio://
prefix).
This option is useful for testing but it is not recommended for large corpouses of images as listing files in s3 is a slow operation. In this mode, all the images in the recursive subfolders of the given folders will be used. - Alternatively (and recommended) create a file with the list of all your images in the following format:
files.txt
you can run with input_dir=’/path/to/files.txt’